Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation

Background

Image-level labels indicating only the presence of objects
Class Activation Map (CAM)
Co-occurrence（共现）

seperate co-occurrences & conquer false activation

Abstract

Weakly supervised semantic segmentation (WSSS) with image-level labels aims to achieve segmentation tasks without dense annotations. However, attributed to the frequent coupling of co-occurring objects and the limited supervision from image-level labels, the challenging co-occurrence problem is widely present and leads to false activation of objects in WSSS. In this work, we devise a ‘Separate and Conquer’ scheme SeCo to tackle this issue from dimensions of image space and feature space. In the image space, we propose to ‘separate’ the co-occurring objects with image decomposition by subdividing images into patches. Importantly, we assign each patch a category tag from Class Activation Maps (CAMs), which spatially helps remove the co-context bias and guide the subsequent representation. In the feature space, we propose to ‘conquer’ the false activation by enhancing semantic representation with multi-granularity knowledge contrast. To this end, a dual-teacher-single-student architecture is designed and tag-guided contrast is conducted, which guarantee the correctness of knowledge and further facilitate the discrepancy among co-contexts. We streamline the multi-staged WSSS pipeline end-to-end and tackle this issue without external supervision. Extensive experiments are conducted, validating the efficiency of our method and the superiority over previous single-staged and even multi-staged competitors on PASCAL VOC and MS COCO. Code is available here.

Motivation

Co-occurrence of objects is inevitable, always leading to false positive pixels activated with high probability, i.e., confusing the model by error-prone feature representation. To deal with such issue, a common practice is to introduce external supervision or human prior.

So, why not seperate the coupled objects first to generate patches at the beginning?

Each patch contains single category infomation, followed by enhancing category-specific representation with dual-teacher single-student archtecture.

Overall Architecture

Methods

Given input image $\boldsymbol{I}$ containing $K$ classes of objects $\left\{Y_i\right\}\left(i = 1, 2, \cdots, K\right)$ .

decomposition

Train an initial classification model to generate CAM seeds (auxiliary classification head in the teacher network);
Decompose $\boldsymbol{I}$ to patch $\left\{\boldsymbol{x}_i\right\}\left(i = 1, 2, \cdots, n\right) = \mathrm{crop}\left(I\right)$ ;
Assign CAM’s tag $t_i$ to $\boldsymbol{x}_i$ .

In the teacher’s $\lambda$ <sup>th</sup> layer:

$\boldsymbol{W}_\lambda$ is the mapping matrix in the encoder;
$\boldsymbol{Z}_\mathrm{F}^\lambda$ is the feature map.

Construct auxiliary pseudo mask $\boldsymbol{M}_\mathrm{aux}$ by:

$$\mathrm{CAM_aux} = \mathrm{ReLU}\left(\boldsymbol{W}_\lambda^\mathsf{T}\boldsymbol{Z}_\mathrm{F}^\lambda\right).$$

Use $\mathrm{CAM_aux}$ to guide category tag allocation:

let $\boldsymbol{m}_i = \mathrm{crop}\left(\boldsymbol{M}_\mathrm{aux}\right)$ , and assign $\boldsymbol{m}_i$ to $\boldsymbol{x}_i$ .

Methods

decomposition

For representation of local patches, ViT is used to extract high-level semantics.

Weak data augmentation for student: $\mathrm{Aug}_\mathrm{q}\left(\boldsymbol{x}\right)$ ;
Strong data augmentation for local teacher: $\mathrm{Aug}_\mathrm{k}\left(\boldsymbol{x}\right)$ .

$g_\mathrm{q}$ for student encoder, $g_\mathrm{k}$ for local teacher encoder. Use class token in ViT to extract high-level semantics, then apply MLP to strenthen features.

decomposition spatially seperates co-occurrences, but may destruct the semantic context of patches.

Methods

representation

Use a global teacher to extract knowledge from the entire image.
Share encoder between the global teacher and the student.
Instead of extracting semantics based on CAMs, the global teacher uses class tokens to represent high-level semantics and obtains knowledge $\boldsymbol{P}_l\left(l = 1, 2, \cdots, K\right)$ (i.e., class semantic centroid, also mentioned in introduction, which helps push apart co-contexts), avoiding the noise from false localization of CAMs.

Self-attention mechanism gathers global sementics, avoiding the limitations of CAM based methods when taken globally (easily confused by co-occurrences). Recall that the goal of “decomposition” is to “decouple” co-occurrences.

Adaptive updating strategy, to gather semantics across the dataset.

Note that ViT is used to gather semantics in one specific image.

Methods

representation

adaptive updating

Example: image A has a prototype $a$ for class “boat”, and image B has another prototype $b$ for class “boat”, then the cosine similarity between $a$ and $b$ is calculated. Then, this similarity score is applied softmax to get weights $W_l$ .

Prototypes for the same class from different images will contribute to the same global prototype.

Given multi-class token obtained from global teacher encoder $\boldsymbol{Z}_l$ , the updating process:

$$\boldsymbol{P}_l \leftarrow \mathrm{Norm}\left(\eta\; \boldsymbol{P}_l + \left(1 - \eta\right)\; W_l \cdot \boldsymbol{Z}_l\right).$$

Applying exponential average (for dynamic prediction) upon knowledge and weighted token.

Methods

utilize positive prototypes

An image is cropped into $u$ patches. $\boldsymbol{q}_i$ is the local feature extracted by student. $\boldsymbol{P}_l^+$ is the positive prototype belonging to the same category with $\boldsymbol{q}_i$ .

$$\mathcal{L}_{\mathrm{LiG}} = - \frac{1}{N_\mathrm{g}^+}\;\sum_{i = 1}^{u} \log\;\frac{\exp\left(\frac{\boldsymbol{q}_i^\mathsf{T}\boldsymbol{P}_l^+}{\tau_\mathrm{g}}\right)}{\sum_{\boldsymbol{P}_l \in \boldsymbol{P}_{\mathrm{s}}}\exp\left(\frac{\boldsymbol{q}_i^\mathsf{T}\boldsymbol{P}_l^+}{\tau_\mathrm{g}}\right)}$$

Force $\boldsymbol{q}_i$ to be close to its corresponding global prototype (centroid).

Methods

merely usilizing positive prototypes is not enough

A category tag pool is proposed to match the memory bank.

$$B_\mathrm{q}, B_\mathrm{k} = \left(\overbrace{\boldsymbol{x}}^\text{query}, \overbrace{t}^\text{key embedings}\right).$$

Use a queue to capture chronological information: enqueue newest $B_\mathrm{q}, B_\mathrm{k}$ and dequeue the oldest $B_\mathrm{-q}, B_\mathrm{-k}$ . Update local teacher from student with EMA to keep the memories consistent for contrast and avoid the dramatic variance between the older memories and the newest in the reservoir.

Methods

rectify noisy tags

A similarity-based rectification strategy to denoise the tags.

Measure similarity between $\boldsymbol{q}_i$ and its history embedings.

$$\mu\left(\boldsymbol{q}_i, t_i\right) = \frac{1}{\left|R\left(\boldsymbol{x}, t_i\right)\right|}\sum_{k_+ \in R\left(\boldsymbol{x}, t_i\right)_+} \boldsymbol{q}_i^\mathsf{T}\boldsymbol{k}_+.$$

The similarity between two patches with the same category should be significantly higher than those different.

Once the number of abnormal similarity pairs exceeds a certain proportion $σ$ , we consider $\boldsymbol{q}_i$ as a noisy embedding eventually.

If $\frac{1}{\left|R\left(\boldsymbol{x}, t_i\right)_+\right|}\sum_{\boldsymbol{k}_+ \in R\left(\boldsymbol{x}, t_i\right)} \mathbb{1}\left(\boldsymbol{q}_i^\mathsf{T}\boldsymbol{k}_+ \lt \mu\left(\boldsymbol{q}_i, t_i\right)\right) \gt \sigma$ , then $t_i \leftarrow -1$

Methods

tag guided contrast

Recall that each patch has its category tag (noisy tags already rectified).

Patch-level co-category differentiation:

$$\mathcal{L}_\mathrm{LiL} = - \frac{1}{N_l^+}\sum_{i=1}^n\sum_{\boldsymbol{k}_+}M_f\log\frac{\exp\left(\frac{\boldsymbol{q}_i^\mathsf{T}\boldsymbol{k}_+}{\tau_l}\right)}{\sum_{\boldsymbol{k}^\prime \in R \left(\boldsymbol{x}, t\right)}\exp\left(\frac{\boldsymbol{q}_i^\mathsf{T}\boldsymbol{k}^\prime}{\tau_l}\right)}$$

$M_f$ is rectification mask to exclude noisy patches;
$n$ is the number of patches.

Methods

training objectives

Loss for SeCo:

$$\mathcal{L}_\mathrm{SeCo} = \mathcal{L}_\mathrm{cls} + \mathcal{L}_\mathrm{cls}^\mathrm{aux} + \alpha\mathcal{L}_\mathrm{LiG} + \beta\mathcal{L}_\mathrm{LiL}.$$

For overall loss, add segmentation loss:

$$\mathcal{L} = \mathcal{L}_\mathrm{SeCo} + \gamma\mathcal{L}_{seg}.$$

Experiments

comparison with SOTAs

Experiments

abation

Experiments

comparison with other cecent methods

Background

Abstract

Motivation

Overall Architecture

Methods

decomposition

Methods

decomposition

Methods

representation

Methods

representation

adaptive updating

Methods

utilize positive prototypes

Methods

merely usilizing positive prototypes is not enough

Methods

rectify noisy tags

Methods

tag guided contrast

Methods

training objectives

Experiments

comparison with SOTAs

Experiments

abation

Experiments

comparison with other cecent methods

Experiments

performance

Summary

contribution

limitation